{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "0d7688a7",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/output_parsing/table_qa.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "530c973e-916d-4c9e-9365-e2d5306d7e3d",
   "metadata": {},
   "source": [
    "# Tables QA program\n",
    "\n",
    "In this example, we show how to perform Table Question Answering Task with 3 different baseline approaches.\n",
    "\n",
    "Three approaches:\n",
    "1. Take the raw text output (the whole page from PDF parser) as input for query engine for prediction\n",
    "2. Using Recursive Retrieval to retrieve relevant nodes with tables to answer the question\n",
    "3. Generate dataframe for each table and using `PandasQueryEngine` as nodes for recursive retriever."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "81e5dde0",
   "metadata": {},
   "source": [
    "If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b2833cea",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a308c3b5",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "OPENAI_API_TOKEN = \"sk-\"  # Your OpenAI API token here\n",
    "os.environ[\"OPENAI_API_KEY\"] = OPENAI_API_TOKEN"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0c0d1f2",
   "metadata": {},
   "source": [
    "### Loading Raw Text from PDF parser as document nodes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b1f1a703",
   "metadata": {},
   "outputs": [],
   "source": [
    "# from llama_index.llms import MockLLM\n",
    "from llama_index.node_parser import (\n",
    "    MarkdownElementNodeParser,\n",
    ")\n",
    "from llama_index.schema import Document, IndexNode, TextNode\n",
    "\n",
    "\n",
    "test_table_document = Document(\n",
    "    text=\"\"\"\n",
    "|Model|Size|Code|Commonsense Reasoning|World Knowledge|Reading Comprehension|Math MMLU|BBH|AGI Eval|\n",
    "|---|---|---|---|---|---|---|---|---|\n",
    "|MPT|7B|20.5|57.4|41.0|57.5|4.9|26.8|31.0|\n",
    "|MPT|30B|28.9|64.9|50.0|64.7|9.1|46.9|38.0|\n",
    "|Falcon|7B|5.6|56.1|42.8|36.0|4.6|26.2|28.0|\n",
    "|Falcon|40B|15.2|69.2|56.7|65.7|12.6|55.4|37.1|\n",
    "|Falcon|7B|14.1|60.8|46.2|58.5|6.95|35.1|30.3|\n",
    "|Llama 1|13B|18.9|66.1|52.6|62.3|10.9|46.9|37.0|\n",
    "|Llama 1|33B|26.0|70.0|58.4|67.6|21.4|57.8|39.8|\n",
    "|Llama 1|65B|30.7|70.7|60.5|68.6|30.8|63.4|43.5|\n",
    "|Llama 1|7B|16.8|63.9|48.9|61.3|14.6|45.3|32.6|\n",
    "|Llama 2|13B|24.5|66.9|55.4|65.8|28.7|54.8|39.4|\n",
    "|Llama 2|34B|27.8|69.9|58.7|68.0|24.2|62.6|44.1|\n",
    "|Llama 2|70B|37.5|71.9|63.6|69.4|35.2|68.9|51.2|\n",
    "\n",
    "    Table 3: Overall performance on grouped academic benchmarks compared to open-source base models.\n",
    "\n",
    "    • Popular Aggregated Benchmarks. We report the overall results for MMLU (5 shot) (Hendrycks et al., 2020), Big Bench Hard (BBH) (3 shot) (Suzgun et al., 2022), and AGI Eval (3–5 shot) (Zhong et al., 2023). For AGI Eval, we only evaluate on the English tasks and report the average.\n",
    "\n",
    "    As shown in Table 3, Llama 2 models outperform Llama 1 models. In particular, Llama 2 70B improves the results on MMLU and BBH by ⇡5 and ⇡8 points, respectively, compared to Llama 1 65B. Llama 2 7B and 30B models outperform MPT models of the corresponding size on all categories besides code benchmarks. For the Falcon models, Llama 2 7B and 34B outperform Falcon 7B and 40B models on all categories of benchmarks. Additionally, Llama 2 70B model outperforms all open-source models.\n",
    "\n",
    "    In addition to open-source models, we also compare Llama 2 70B results to closed-source models. As shown in Table 4, Llama 2 70B is close to GPT-3.5 (OpenAI, 2023) on MMLU and GSM8K, but there is a significant gap on coding benchmarks. Llama 2 70B results are on par or better than PaLM (540B) (Chowdhery et al., 2022) on almost all benchmarks. There is still a large gap in performance between Llama 2 70B and GPT-4 and PaLM-2-L.\n",
    "\n",
    "    We also analysed the potential data contamination and share the details in Section A.6.\n",
    "\n",
    "|Benchmark (shots)|GPT-3.5|GPT-4|PaLM|PaLM-2-L|Llama 2|\n",
    "|---|---|---|---|---|---|\n",
    "|MMLU (5-shot)|70.0|86.4|69.3|78.3|68.9|\n",
    "|TriviaQA (1-shot)|–|–|81.4|86.1|85.0|\n",
    "|Natural Questions (1-shot)|–|–|29.3|37.5|33.0|\n",
    "|GSM8K (8-shot)|57.1|92.0|56.5|80.7|56.8|\n",
    "|HumanEval (0-shot)|48.1|67.0|26.2|–|29.9|\n",
    "|BIG-Bench Hard (3-shot)|–|–|52.3|65.7|51.2|\n",
    "\n",
    "    Table 4: Comparison to closed-source models on academic benchmarks. Results for GPT-3.5 and GPT-4 are from OpenAI (2023). Results for the PaLM model are from Chowdhery et al. (2022). Results for the PaLM-2-L are from Anil et al. (2023).\n",
    "\n",
    "    3 Fine-tuning\n",
    "\n",
    "    Llama 2-Chat is the result of several months of research and iterative applications of alignment techniques, including both instruction tuning and RLHF, requiring significant computational and annotation resources. In this section, we report on our experiments and findings using supervised fine-tuning (Section 3.1), as well as initial and iterative reward modeling (Section 3.2.2) and RLHF (Section 3.2.3). We also share a new technique, Ghost Attention (GAtt), which we find helps control dialogue flow over multiple turns (Section 3.3). See Section 4.2 for safety evaluations on fine-tuned models.\n",
    "\n",
    "    \n",
    "    We initialize our reward models from pretrained chat model checkpoints, as it ensures that both models\n",
    "    benefit from knowledge acquired in pretraining. In short, the reward model “knows” what the chat model\n",
    "    ---\n",
    "    # Statistics of human preference data for reward modeling\n",
    "\n",
    "    The reward model takes a model response and its corresponding prompt (including contexts from previous turns) as inputs and outputs a scalar score to indicate the quality (e.g., helpfulness and safety) of the model generation. Leveraging such response scores as rewards, we can optimize Llama 2-Chat during RLHF for better human preference alignment and improved helpfulness and safety.\n",
    "\n",
    "    Others have found that helpfulness and safety sometimes trade off (Bai et al., 2022a), which can make it challenging for a single reward model to perform well on both. To address this, we train two separate reward models, one optimized for helpfulness (referred to as Helpfulness RM) and another for safety (Safety RM).\n",
    "\n",
    "    We initialize our reward models from pretrained chat model checkpoints, as it ensures that both models benefit from knowledge acquired in pretraining. In short, the reward model “knows” what the chat model\n",
    "\n",
    "|Dataset|Num. of Comparisons|Avg. # Turns per Dialogue|Avg. # Tokens per Example|Avg. # Tokens in Prompt|Avg. # Tokens in Response|\n",
    "|---|---|---|---|---|---|\n",
    "|Anthropic Helpful|122,387|3.0|251.5|17.7|88.4|\n",
    "|Anthropic Harmless|43,966|3.0|152.5|15.7|46.4|\n",
    "|OpenAI Summarize|176,625|1.0|371.1|336.0|35.1|\n",
    "|OpenAI WebGPT|13,333|1.0|237.2|48.3|188.9|\n",
    "|StackExchange|1,038,480|1.0|440.2|200.1|240.2|\n",
    "|Stanford SHP|74,882|1.0|338.3|199.5|138.8|\n",
    "|Synthetic GPT-J|33,139|1.0|123.3|13.0|110.3|\n",
    "|Meta (Safety & Helpfulness)|1,418,091|3.9|798.5|31.4|234.1|\n",
    "|Total|2,919,326|1.6|595.7|108.2|216.9|\n",
    "\n",
    "    Table 6: Statistics of human preference data for reward modeling. We list both the open-source and internally collected human preference data used for reward modeling. Note that a binary human preference comparison contains 2 responses (chosen and rejected) sharing the same prompt (and previous dialogue). Each example consists of a prompt (including previous dialogue if available) and a response, which is the input of the reward model. We report the number of comparisons, the average number of turns per dialogue, the average number of tokens per example, per prompt and per response. More details on Meta helpfulness and safety data per batch can be found in Appendix A.3.1.\n",
    "\n",
    "    knows. This prevents cases where, for instance, the two models would have an information mismatch, which could result in favoring hallucinations. The model architecture and hyper-parameters are identical to those of the pretrained language models, except that the classification head for next-token prediction is replaced with a regression head for outputting a scalar reward.\n",
    "\n",
    "    Training Objectives. To train the reward model, we convert our collected pairwise human preference data into a binary ranking label format (i.e., chosen & rejected) and enforce the chosen response to have a higher score than its counterpart. We used a binary ranking loss consistent with Ouyang et al. (2022):\n",
    "\n",
    "    Lranking = −log(σ(r✓(x, yc) − r✓(x, yr))) (1)\n",
    "\n",
    "    where r✓(x, y) is the scalar score output for prompt x and completion y with model weights ✓. yc is the preferred response that annotators choose and yr is the rejected counterpart.\n",
    "\n",
    "    Built on top of this binary ranking loss, we further modify it separately for better helpfulness and safety reward models as follows. Given that our preference ratings is decomposed as a scale of four points (e.g., significantly better), as presented in Section 3.2.1, it can be useful to leverage this information to explicitly teach the reward model to assign more discrepant scores to the generations that have more differences. To do so, we further add a margin component in the loss:\n",
    "        \"\"\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e33aa3a6",
   "metadata": {},
   "source": [
    "## Baseline 1: Using PDF parser Raw Output (containing multiple tables) as Input for Query Engine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "63705eba",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The context information does not provide the performance of MPT 30B specifically for common sense reasoning.\n",
      "PaLM-2-L performance for TriviaQA is 86.1.\n",
      "Llama 2's performance for HumanEval is 29.9.\n",
      "Llama 2's performance for AGI Eval is not mentioned in the given context.\n",
      "LLAMA 2's performance for HumanEval and AGI Eval is not mentioned in the given context.\n"
     ]
    }
   ],
   "source": [
    "from llama_index import (\n",
    "    VectorStoreIndex,\n",
    "    get_response_synthesizer,\n",
    ")\n",
    "from llama_index.retrievers import VectorIndexRetriever\n",
    "from llama_index.query_engine import RetrieverQueryEngine\n",
    "\n",
    "# build index\n",
    "index = VectorStoreIndex.from_documents([test_table_document])\n",
    "\n",
    "# configure retriever\n",
    "retriever = VectorIndexRetriever(\n",
    "    index=index,\n",
    "    similarity_top_k=2,\n",
    ")\n",
    "# assemble query engine\n",
    "query_engine = RetrieverQueryEngine(\n",
    "    retriever=retriever,\n",
    ")\n",
    "\n",
    "# query different questions\n",
    "response_1 = query_engine.query(\n",
    "    \"What is MPT 30b performance for common sense reasoning?\"\n",
    ")\n",
    "print(response_1)\n",
    "\n",
    "response_2 = query_engine.query(\"What is PaLM-2-L performance for TriviaQA?\")\n",
    "print(response_2)\n",
    "\n",
    "\n",
    "response_3 = query_engine.query(\"What is LLAMA 2 performance for HumanEval?\")\n",
    "print(response_3)\n",
    "\n",
    "response_4 = query_engine.query(\"What is LLAMA 2 performance for AGI Eval?\")\n",
    "print(response_4)\n",
    "\n",
    "response_5 = query_engine.query(\n",
    "    \"What is LLAMA 2 performance for HumanEval and AGI Eval?\"\n",
    ")\n",
    "print(response_5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f395509f",
   "metadata": {},
   "source": [
    "Observation: Baseline 1 approach failed to give correct answers for Questions 4 & 5"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a2b37fe",
   "metadata": {},
   "source": [
    "## Baseline 2: Apply `MarkdownElementNodeParser` for parsing table/text nodes and using `Recursive Retriever` to retrieve relevant nodes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2afe340",
   "metadata": {},
   "source": [
    "### Paring nodes using `MarkdownElementNodeParser`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14403f74",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Embeddings have been explicitly disabled. Using MockEmbedding.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0it [00:00, ?it/s]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "3it [00:19,  6.34s/it]\n"
     ]
    }
   ],
   "source": [
    "node_parser = MarkdownElementNodeParser()\n",
    "\n",
    "doc_nodes = node_parser.get_nodes_from_documents([test_table_document])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5a8ee6da",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "9\n"
     ]
    }
   ],
   "source": [
    "print(len(doc_nodes))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "216dc5aa",
   "metadata": {},
   "source": [
    "### Get index nodes and child nodes mapping for recursive retriever"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f824811c",
   "metadata": {},
   "outputs": [],
   "source": [
    "base_nodes, node_mappings = node_parser.get_base_nodes_and_mappings(doc_nodes)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ae94e55",
   "metadata": {},
   "source": [
    "## Table Retrieval and Question Answering using Recursive Retrieval Query Engine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c31c7dd1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.retrievers import RecursiveRetriever\n",
    "from llama_index.query_engine import RetrieverQueryEngine\n",
    "from llama_index import VectorStoreIndex\n",
    "from llama_index.embeddings import TogetherEmbedding, OpenAIEmbedding\n",
    "from llama_index.llms import OpenAI\n",
    "from llama_index.service_context import ServiceContext\n",
    "\n",
    "# construct top-level vector index + query engine\n",
    "vector_index = VectorStoreIndex(doc_nodes)\n",
    "vector_retriever = vector_index.as_retriever(similarity_top_k=3)\n",
    "vector_query_engine = vector_index.as_query_engine(similarity_top_k=3)\n",
    "\n",
    "\n",
    "recursive_retriever = RecursiveRetriever(\n",
    "    \"vector\",\n",
    "    retriever_dict={\"vector\": vector_retriever},\n",
    "    node_dict=node_mappings,\n",
    "    verbose=True,\n",
    ")\n",
    "\n",
    "llm = OpenAI(temperature=0, model=\"gpt-4\")\n",
    "service_context = ServiceContext.from_defaults(llm=llm)\n",
    "\n",
    "recursive_query_engine = RetrieverQueryEngine.from_args(\n",
    "    recursive_retriever, service_context=service_context\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1edbf239",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[1;3;34mRetrieving with query id None: What is MPT 30b performance for common sense reasoning?\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieving text node: This table provides information on different models and their performance in various tasks such as commonsense reasoning, world knowledge, reading comprehension, math MMLU, BBH, and AGI evaluation. The table includes the model names, sizes, and scores for each task.,\n",
      "with the following table title:\n",
      "Model Performance Comparison,\n",
      "with the following columns:\n",
      "- Model: The name of the model\n",
      "- Size: The size of the model\n",
      "- Code: The code score for the model\n",
      "- Commonsense Reasoning: The score for commonsense reasoning task\n",
      "- World Knowledge: The score for world knowledge task\n",
      "- Reading Comprehension: The score for reading comprehension task\n",
      "- Math MMLU: The score for math MMLU task\n",
      "- BBH: The score for BBH task\n",
      "- AGI Eval: The score for AGI evaluation task\n",
      "\n",
      "|Model|Size|Code|Commonsense Reasoning|World Knowledge|Reading Comprehension|Math MMLU|BBH|AGI Eval|\n",
      "|---|---|---|---|---|---|---|---|---|\n",
      "|MPT|7B|20.5|57.4|41.0|57.5|4.9|26.8|31.0|\n",
      "|MPT|30B|28.9|64.9|50.0|64.7|9.1|46.9|38.0|\n",
      "|Falcon|7B|5.6|56.1|42.8|36.0|4.6|26.2|28.0|\n",
      "|Falcon|40B|15.2|69.2|56.7|65.7|12.6|55.4|37.1|\n",
      "|Falcon|7B|14.1|60.8|46.2|58.5|6.95|35.1|30.3|\n",
      "|Llama 1|13B|18.9|66.1|52.6|62.3|10.9|46.9|37.0|\n",
      "|Llama 1|33B|26.0|70.0|58.4|67.6|21.4|57.8|39.8|\n",
      "|Llama 1|65B|30.7|70.7|60.5|68.6|30.8|63.4|43.5|\n",
      "|Llama 1|7B|16.8|63.9|48.9|61.3|14.6|45.3|32.6|\n",
      "|Llama 2|13B|24.5|66.9|55.4|65.8|28.7|54.8|39.4|\n",
      "|Llama 2|34B|27.8|69.9|58.7|68.0|24.2|62.6|44.1|\n",
      "|Llama 2|70B|37.5|71.9|63.6|69.4|35.2|68.9|51.2|\n",
      "\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieved node with id, entering: id_1_table\n",
      "\u001b[0m\u001b[1;3;34mRetrieving with query id id_1_table: What is MPT 30b performance for common sense reasoning?\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieving text node: This table provides benchmark scores for different language models, including GPT-3.5, GPT-4, PaLM, PaLM-2-L, and Llama 2. The scores are measured in various tasks such as MMLU, TriviaQA, Natural Questions, GSM8K, HumanEval, and BIG-Bench Hard. The table shows the performance of each model in different shot scenarios.,\n",
      "with the following table title:\n",
      "Benchmark Scores for Language Models,\n",
      "with the following columns:\n",
      "- Benchmark (shots): The tasks and shot scenarios used for benchmarking\n",
      "- GPT-3.5: The benchmark score for the GPT-3.5 model\n",
      "- GPT-4: The benchmark score for the GPT-4 model\n",
      "- PaLM: The benchmark score for the PaLM model\n",
      "- PaLM-2-L: The benchmark score for the PaLM-2-L model\n",
      "- Llama 2: The benchmark score for the Llama 2 model\n",
      "\n",
      "|Benchmark (shots)|GPT-3.5|GPT-4|PaLM|PaLM-2-L|Llama 2|\n",
      "|---|---|---|---|---|---|\n",
      "|MMLU (5-shot)|70.0|86.4|69.3|78.3|68.9|\n",
      "|TriviaQA (1-shot)|–|–|81.4|86.1|85.0|\n",
      "|Natural Questions (1-shot)|–|–|29.3|37.5|33.0|\n",
      "|GSM8K (8-shot)|57.1|92.0|56.5|80.7|56.8|\n",
      "|HumanEval (0-shot)|48.1|67.0|26.2|–|29.9|\n",
      "|BIG-Bench Hard (3-shot)|–|–|52.3|65.7|51.2|\n",
      "\n",
      "\u001b[0mThe performance of the MPT 30B model for the commonsense reasoning task is 64.9.\n",
      "\u001b[1;3;34mRetrieving with query id None: What is PaLM-2-L performance for TriviaQA?\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieved node with id, entering: id_3_table\n",
      "\u001b[0m\u001b[1;3;34mRetrieving with query id id_3_table: What is PaLM-2-L performance for TriviaQA?\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieving text node: This table provides benchmark scores for different language models, including GPT-3.5, GPT-4, PaLM, PaLM-2-L, and Llama 2. The scores are measured in various tasks such as MMLU, TriviaQA, Natural Questions, GSM8K, HumanEval, and BIG-Bench Hard. The table shows the performance of each model in different shot scenarios.,\n",
      "with the following table title:\n",
      "Benchmark Scores for Language Models,\n",
      "with the following columns:\n",
      "- Benchmark (shots): The tasks and shot scenarios used for benchmarking\n",
      "- GPT-3.5: The benchmark score for the GPT-3.5 model\n",
      "- GPT-4: The benchmark score for the GPT-4 model\n",
      "- PaLM: The benchmark score for the PaLM model\n",
      "- PaLM-2-L: The benchmark score for the PaLM-2-L model\n",
      "- Llama 2: The benchmark score for the Llama 2 model\n",
      "\n",
      "|Benchmark (shots)|GPT-3.5|GPT-4|PaLM|PaLM-2-L|Llama 2|\n",
      "|---|---|---|---|---|---|\n",
      "|MMLU (5-shot)|70.0|86.4|69.3|78.3|68.9|\n",
      "|TriviaQA (1-shot)|–|–|81.4|86.1|85.0|\n",
      "|Natural Questions (1-shot)|–|–|29.3|37.5|33.0|\n",
      "|GSM8K (8-shot)|57.1|92.0|56.5|80.7|56.8|\n",
      "|HumanEval (0-shot)|48.1|67.0|26.2|–|29.9|\n",
      "|BIG-Bench Hard (3-shot)|–|–|52.3|65.7|51.2|\n",
      "\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieving text node: Table 4: Comparison to closed-source models on academic benchmarks. Results for GPT-3.5 and GPT-4 are from OpenAI (2023). Results for the PaLM model are from Chowdhery et al. (2022). Results for the PaLM-2-L are from Anil et al. (2023).\n",
      "\n",
      "    3 Fine-tuning\n",
      "\n",
      "    Llama 2-Chat is the result of several months of research and iterative applications of alignment techniques, including both instruction tuning and RLHF, requiring significant computational and annotation resources. In this section, we report on our experiments and findings using supervised fine-tuning (Section 3.1), as well as initial and iterative reward modeling (Section 3.2.2) and RLHF (Section 3.2.3). We also share a new technique, Ghost Attention (GAtt), which we find helps control dialogue flow over multiple turns (Section 3.3). See Section 4.2 for safety evaluations on fine-tuned models.\n",
      "\n",
      "    \n",
      "    We initialize our reward models from pretrained chat model checkpoints, as it ensures that both models\n",
      "    benefit from knowledge acquired in pretraining. In short, the reward model “knows” what the chat model\n",
      "    ---\n",
      "    # Statistics of human preference data for reward modeling\n",
      "\n",
      "    The reward model takes a model response and its corresponding prompt (including contexts from previous turns) as inputs and outputs a scalar score to indicate the quality (e.g., helpfulness and safety) of the model generation. Leveraging such response scores as rewards, we can optimize Llama 2-Chat during RLHF for better human preference alignment and improved helpfulness and safety.\n",
      "\n",
      "    Others have found that helpfulness and safety sometimes trade off (Bai et al., 2022a), which can make it challenging for a single reward model to perform well on both. To address this, we train two separate reward models, one optimized for helpfulness (referred to as Helpfulness RM) and another for safety (Safety RM).\n",
      "\n",
      "    We initialize our reward models from pretrained chat model checkpoints, as it ensures that both models benefit from knowledge acquired in pretraining. In short, the reward model “knows” what the chat model\n",
      "\u001b[0mThe PaLM-2-L model scored 86.1 on the TriviaQA benchmark.\n",
      "\u001b[1;3;34mRetrieving with query id None: What is LLAMA 2 performance for HumanEval?\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieving text node: Table 3: Overall performance on grouped academic benchmarks compared to open-source base models.\n",
      "\n",
      "    • Popular Aggregated Benchmarks. We report the overall results for MMLU (5 shot) (Hendrycks et al., 2020), Big Bench Hard (BBH) (3 shot) (Suzgun et al., 2022), and AGI Eval (3–5 shot) (Zhong et al., 2023). For AGI Eval, we only evaluate on the English tasks and report the average.\n",
      "\n",
      "    As shown in Table 3, Llama 2 models outperform Llama 1 models. In particular, Llama 2 70B improves the results on MMLU and BBH by ⇡5 and ⇡8 points, respectively, compared to Llama 1 65B. Llama 2 7B and 30B models outperform MPT models of the corresponding size on all categories besides code benchmarks. For the Falcon models, Llama 2 7B and 34B outperform Falcon 7B and 40B models on all categories of benchmarks. Additionally, Llama 2 70B model outperforms all open-source models.\n",
      "\n",
      "    In addition to open-source models, we also compare Llama 2 70B results to closed-source models. As shown in Table 4, Llama 2 70B is close to GPT-3.5 (OpenAI, 2023) on MMLU and GSM8K, but there is a significant gap on coding benchmarks. Llama 2 70B results are on par or better than PaLM (540B) (Chowdhery et al., 2022) on almost all benchmarks. There is still a large gap in performance between Llama 2 70B and GPT-4 and PaLM-2-L.\n",
      "\n",
      "    We also analysed the potential data contamination and share the details in Section A.6.\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieved node with id, entering: id_3_table\n",
      "\u001b[0m\u001b[1;3;34mRetrieving with query id id_3_table: What is LLAMA 2 performance for HumanEval?\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieving text node: This table provides benchmark scores for different language models, including GPT-3.5, GPT-4, PaLM, PaLM-2-L, and Llama 2. The scores are measured in various tasks such as MMLU, TriviaQA, Natural Questions, GSM8K, HumanEval, and BIG-Bench Hard. The table shows the performance of each model in different shot scenarios.,\n",
      "with the following table title:\n",
      "Benchmark Scores for Language Models,\n",
      "with the following columns:\n",
      "- Benchmark (shots): The tasks and shot scenarios used for benchmarking\n",
      "- GPT-3.5: The benchmark score for the GPT-3.5 model\n",
      "- GPT-4: The benchmark score for the GPT-4 model\n",
      "- PaLM: The benchmark score for the PaLM model\n",
      "- PaLM-2-L: The benchmark score for the PaLM-2-L model\n",
      "- Llama 2: The benchmark score for the Llama 2 model\n",
      "\n",
      "|Benchmark (shots)|GPT-3.5|GPT-4|PaLM|PaLM-2-L|Llama 2|\n",
      "|---|---|---|---|---|---|\n",
      "|MMLU (5-shot)|70.0|86.4|69.3|78.3|68.9|\n",
      "|TriviaQA (1-shot)|–|–|81.4|86.1|85.0|\n",
      "|Natural Questions (1-shot)|–|–|29.3|37.5|33.0|\n",
      "|GSM8K (8-shot)|57.1|92.0|56.5|80.7|56.8|\n",
      "|HumanEval (0-shot)|48.1|67.0|26.2|–|29.9|\n",
      "|BIG-Bench Hard (3-shot)|–|–|52.3|65.7|51.2|\n",
      "\n",
      "\u001b[0mThe performance of the Llama 2 model for HumanEval (0-shot) is 29.9.\n",
      "\u001b[1;3;34mRetrieving with query id None: What is LLAMA 2 performance for AGI Eval?\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieving text node: Table 3: Overall performance on grouped academic benchmarks compared to open-source base models.\n",
      "\n",
      "    • Popular Aggregated Benchmarks. We report the overall results for MMLU (5 shot) (Hendrycks et al., 2020), Big Bench Hard (BBH) (3 shot) (Suzgun et al., 2022), and AGI Eval (3–5 shot) (Zhong et al., 2023). For AGI Eval, we only evaluate on the English tasks and report the average.\n",
      "\n",
      "    As shown in Table 3, Llama 2 models outperform Llama 1 models. In particular, Llama 2 70B improves the results on MMLU and BBH by ⇡5 and ⇡8 points, respectively, compared to Llama 1 65B. Llama 2 7B and 30B models outperform MPT models of the corresponding size on all categories besides code benchmarks. For the Falcon models, Llama 2 7B and 34B outperform Falcon 7B and 40B models on all categories of benchmarks. Additionally, Llama 2 70B model outperforms all open-source models.\n",
      "\n",
      "    In addition to open-source models, we also compare Llama 2 70B results to closed-source models. As shown in Table 4, Llama 2 70B is close to GPT-3.5 (OpenAI, 2023) on MMLU and GSM8K, but there is a significant gap on coding benchmarks. Llama 2 70B results are on par or better than PaLM (540B) (Chowdhery et al., 2022) on almost all benchmarks. There is still a large gap in performance between Llama 2 70B and GPT-4 and PaLM-2-L.\n",
      "\n",
      "    We also analysed the potential data contamination and share the details in Section A.6.\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieving text node: This table provides information on different models and their performance in various tasks such as commonsense reasoning, world knowledge, reading comprehension, math MMLU, BBH, and AGI evaluation. The table includes the model names, sizes, and scores for each task.,\n",
      "with the following table title:\n",
      "Model Performance Comparison,\n",
      "with the following columns:\n",
      "- Model: The name of the model\n",
      "- Size: The size of the model\n",
      "- Code: The code score for the model\n",
      "- Commonsense Reasoning: The score for commonsense reasoning task\n",
      "- World Knowledge: The score for world knowledge task\n",
      "- Reading Comprehension: The score for reading comprehension task\n",
      "- Math MMLU: The score for math MMLU task\n",
      "- BBH: The score for BBH task\n",
      "- AGI Eval: The score for AGI evaluation task\n",
      "\n",
      "|Model|Size|Code|Commonsense Reasoning|World Knowledge|Reading Comprehension|Math MMLU|BBH|AGI Eval|\n",
      "|---|---|---|---|---|---|---|---|---|\n",
      "|MPT|7B|20.5|57.4|41.0|57.5|4.9|26.8|31.0|\n",
      "|MPT|30B|28.9|64.9|50.0|64.7|9.1|46.9|38.0|\n",
      "|Falcon|7B|5.6|56.1|42.8|36.0|4.6|26.2|28.0|\n",
      "|Falcon|40B|15.2|69.2|56.7|65.7|12.6|55.4|37.1|\n",
      "|Falcon|7B|14.1|60.8|46.2|58.5|6.95|35.1|30.3|\n",
      "|Llama 1|13B|18.9|66.1|52.6|62.3|10.9|46.9|37.0|\n",
      "|Llama 1|33B|26.0|70.0|58.4|67.6|21.4|57.8|39.8|\n",
      "|Llama 1|65B|30.7|70.7|60.5|68.6|30.8|63.4|43.5|\n",
      "|Llama 1|7B|16.8|63.9|48.9|61.3|14.6|45.3|32.6|\n",
      "|Llama 2|13B|24.5|66.9|55.4|65.8|28.7|54.8|39.4|\n",
      "|Llama 2|34B|27.8|69.9|58.7|68.0|24.2|62.6|44.1|\n",
      "|Llama 2|70B|37.5|71.9|63.6|69.4|35.2|68.9|51.2|\n",
      "\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieved node with id, entering: id_1_table\n",
      "\u001b[0m\u001b[1;3;34mRetrieving with query id id_1_table: What is LLAMA 2 performance for AGI Eval?\n",
      "\u001b[0mThe performance of the Llama 2 model for AGI Eval is 51.2.\n",
      "\u001b[1;3;34mRetrieving with query id None: What is LLAMA 2 performance for HumanEval and AGI Eval?\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieving text node: Table 3: Overall performance on grouped academic benchmarks compared to open-source base models.\n",
      "\n",
      "    • Popular Aggregated Benchmarks. We report the overall results for MMLU (5 shot) (Hendrycks et al., 2020), Big Bench Hard (BBH) (3 shot) (Suzgun et al., 2022), and AGI Eval (3–5 shot) (Zhong et al., 2023). For AGI Eval, we only evaluate on the English tasks and report the average.\n",
      "\n",
      "    As shown in Table 3, Llama 2 models outperform Llama 1 models. In particular, Llama 2 70B improves the results on MMLU and BBH by ⇡5 and ⇡8 points, respectively, compared to Llama 1 65B. Llama 2 7B and 30B models outperform MPT models of the corresponding size on all categories besides code benchmarks. For the Falcon models, Llama 2 7B and 34B outperform Falcon 7B and 40B models on all categories of benchmarks. Additionally, Llama 2 70B model outperforms all open-source models.\n",
      "\n",
      "    In addition to open-source models, we also compare Llama 2 70B results to closed-source models. As shown in Table 4, Llama 2 70B is close to GPT-3.5 (OpenAI, 2023) on MMLU and GSM8K, but there is a significant gap on coding benchmarks. Llama 2 70B results are on par or better than PaLM (540B) (Chowdhery et al., 2022) on almost all benchmarks. There is still a large gap in performance between Llama 2 70B and GPT-4 and PaLM-2-L.\n",
      "\n",
      "    We also analysed the potential data contamination and share the details in Section A.6.\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieving text node: This table provides information on different models and their performance in various tasks such as commonsense reasoning, world knowledge, reading comprehension, math MMLU, BBH, and AGI evaluation. The table includes the model names, sizes, and scores for each task.,\n",
      "with the following table title:\n",
      "Model Performance Comparison,\n",
      "with the following columns:\n",
      "- Model: The name of the model\n",
      "- Size: The size of the model\n",
      "- Code: The code score for the model\n",
      "- Commonsense Reasoning: The score for commonsense reasoning task\n",
      "- World Knowledge: The score for world knowledge task\n",
      "- Reading Comprehension: The score for reading comprehension task\n",
      "- Math MMLU: The score for math MMLU task\n",
      "- BBH: The score for BBH task\n",
      "- AGI Eval: The score for AGI evaluation task\n",
      "\n",
      "|Model|Size|Code|Commonsense Reasoning|World Knowledge|Reading Comprehension|Math MMLU|BBH|AGI Eval|\n",
      "|---|---|---|---|---|---|---|---|---|\n",
      "|MPT|7B|20.5|57.4|41.0|57.5|4.9|26.8|31.0|\n",
      "|MPT|30B|28.9|64.9|50.0|64.7|9.1|46.9|38.0|\n",
      "|Falcon|7B|5.6|56.1|42.8|36.0|4.6|26.2|28.0|\n",
      "|Falcon|40B|15.2|69.2|56.7|65.7|12.6|55.4|37.1|\n",
      "|Falcon|7B|14.1|60.8|46.2|58.5|6.95|35.1|30.3|\n",
      "|Llama 1|13B|18.9|66.1|52.6|62.3|10.9|46.9|37.0|\n",
      "|Llama 1|33B|26.0|70.0|58.4|67.6|21.4|57.8|39.8|\n",
      "|Llama 1|65B|30.7|70.7|60.5|68.6|30.8|63.4|43.5|\n",
      "|Llama 1|7B|16.8|63.9|48.9|61.3|14.6|45.3|32.6|\n",
      "|Llama 2|13B|24.5|66.9|55.4|65.8|28.7|54.8|39.4|\n",
      "|Llama 2|34B|27.8|69.9|58.7|68.0|24.2|62.6|44.1|\n",
      "|Llama 2|70B|37.5|71.9|63.6|69.4|35.2|68.9|51.2|\n",
      "\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieved node with id, entering: id_3_table\n",
      "\u001b[0m\u001b[1;3;34mRetrieving with query id id_3_table: What is LLAMA 2 performance for HumanEval and AGI Eval?\n",
      "\u001b[0mThe Llama 2 model scored 29.9 on the HumanEval (0-shot) benchmark. For the AGI Eval task, the Llama 2 70B model scored 51.2.\n"
     ]
    }
   ],
   "source": [
    "# query different questions\n",
    "response_1 = recursive_query_engine.query(\n",
    "    \"What is MPT 30b performance for common sense reasoning?\"\n",
    ")\n",
    "print(response_1)\n",
    "\n",
    "response_2 = recursive_query_engine.query(\n",
    "    \"What is PaLM-2-L performance for TriviaQA?\"\n",
    ")\n",
    "print(response_2)\n",
    "\n",
    "\n",
    "response_3 = recursive_query_engine.query(\n",
    "    \"What is LLAMA 2 performance for HumanEval?\"\n",
    ")\n",
    "print(response_3)\n",
    "\n",
    "response_4 = recursive_query_engine.query(\n",
    "    \"What is LLAMA 2 performance for AGI Eval?\"\n",
    ")\n",
    "print(response_4)\n",
    "\n",
    "response_5 = recursive_query_engine.query(\n",
    "    \"What is LLAMA 2 performance for HumanEval and AGI Eval?\"\n",
    ")\n",
    "print(response_5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1d48d604",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The performance of the MPT 30B model for the commonsense reasoning task is 64.9.\n",
      "The PaLM-2-L model scored 86.1 on the TriviaQA benchmark.\n",
      "The performance of the Llama 2 model for HumanEval (0-shot) is 29.9.\n",
      "The performance of the Llama 2 model for AGI Eval is 51.2.\n",
      "The Llama 2 model scored 29.9 on the HumanEval (0-shot) benchmark. For the AGI Eval task, the Llama 2 70B model scored 51.2.\n"
     ]
    }
   ],
   "source": [
    "print(response_1)\n",
    "print(response_2)\n",
    "print(response_3)\n",
    "print(response_4)\n",
    "print(response_5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ade31bad",
   "metadata": {},
   "source": [
    "Observation: Baseline 2 approach can answer all the 5 questions. However for the 4th question, it only answers partly since LLama 2 model has different variations. `51.2` is for Llama 2 70B model."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "853f3af3",
   "metadata": {},
   "source": [
    "## Baseline 3: Recursive Retrieval + Pandas Query Engine for Table Nodes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6fe9a5cc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3\n",
      "3\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "import pandas as pd\n",
    "import ast\n",
    "\n",
    "table_dfs = []\n",
    "table_summaries = []\n",
    "for node in doc_nodes:\n",
    "    if \"table\" in node.id_:\n",
    "        if \"table_df\" in node.metadata:\n",
    "            table_dfs.append(\n",
    "                pd.DataFrame.from_dict(\n",
    "                    ast.literal_eval(node.metadata[\"table_df\"])\n",
    "                )\n",
    "            )\n",
    "        if \"table_summary\" in node.metadata:\n",
    "            table_summaries.append(node.metadata[\"table_summary\"])\n",
    "            # table_dfs.append(node.metadata[\"table_df\"])\n",
    "print(len(table_dfs))\n",
    "print(len(table_summaries))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc1f6834",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Dataset</th>\n",
       "      <th>Num. of Comparisons</th>\n",
       "      <th>Avg. # Turns per Dialogue</th>\n",
       "      <th>Avg. # Tokens per Example</th>\n",
       "      <th>Avg. # Tokens in Prompt</th>\n",
       "      <th>Avg. # Tokens in Response</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Anthropic Helpful</td>\n",
       "      <td>122,387</td>\n",
       "      <td>3.0</td>\n",
       "      <td>251.5</td>\n",
       "      <td>17.7</td>\n",
       "      <td>88.4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Anthropic Harmless</td>\n",
       "      <td>43,966</td>\n",
       "      <td>3.0</td>\n",
       "      <td>152.5</td>\n",
       "      <td>15.7</td>\n",
       "      <td>46.4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>OpenAI Summarize</td>\n",
       "      <td>176,625</td>\n",
       "      <td>1.0</td>\n",
       "      <td>371.1</td>\n",
       "      <td>336.0</td>\n",
       "      <td>35.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>OpenAI WebGPT</td>\n",
       "      <td>13,333</td>\n",
       "      <td>1.0</td>\n",
       "      <td>237.2</td>\n",
       "      <td>48.3</td>\n",
       "      <td>188.9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>StackExchange</td>\n",
       "      <td>1,038,480</td>\n",
       "      <td>1.0</td>\n",
       "      <td>440.2</td>\n",
       "      <td>200.1</td>\n",
       "      <td>240.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Stanford SHP</td>\n",
       "      <td>74,882</td>\n",
       "      <td>1.0</td>\n",
       "      <td>338.3</td>\n",
       "      <td>199.5</td>\n",
       "      <td>138.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Synthetic GPT-J</td>\n",
       "      <td>33,139</td>\n",
       "      <td>1.0</td>\n",
       "      <td>123.3</td>\n",
       "      <td>13.0</td>\n",
       "      <td>110.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Meta (Safety &amp; Helpfulness)</td>\n",
       "      <td>1,418,091</td>\n",
       "      <td>3.9</td>\n",
       "      <td>798.5</td>\n",
       "      <td>31.4</td>\n",
       "      <td>234.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Total</td>\n",
       "      <td>2,919,326</td>\n",
       "      <td>1.6</td>\n",
       "      <td>595.7</td>\n",
       "      <td>108.2</td>\n",
       "      <td>216.9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                       Dataset Num. of Comparisons  Avg. # Turns per Dialogue  \\\n",
       "0            Anthropic Helpful             122,387                        3.0   \n",
       "1           Anthropic Harmless              43,966                        3.0   \n",
       "2             OpenAI Summarize             176,625                        1.0   \n",
       "3                OpenAI WebGPT              13,333                        1.0   \n",
       "4                StackExchange           1,038,480                        1.0   \n",
       "5                 Stanford SHP              74,882                        1.0   \n",
       "6              Synthetic GPT-J              33,139                        1.0   \n",
       "7  Meta (Safety & Helpfulness)           1,418,091                        3.9   \n",
       "8                        Total           2,919,326                        1.6   \n",
       "\n",
       "   Avg. # Tokens per Example  Avg. # Tokens in Prompt  \\\n",
       "0                      251.5                     17.7   \n",
       "1                      152.5                     15.7   \n",
       "2                      371.1                    336.0   \n",
       "3                      237.2                     48.3   \n",
       "4                      440.2                    200.1   \n",
       "5                      338.3                    199.5   \n",
       "6                      123.3                     13.0   \n",
       "7                      798.5                     31.4   \n",
       "8                      595.7                    108.2   \n",
       "\n",
       "   Avg. # Tokens in Response  \n",
       "0                       88.4  \n",
       "1                       46.4  \n",
       "2                       35.1  \n",
       "3                      188.9  \n",
       "4                      240.2  \n",
       "5                      138.8  \n",
       "6                      110.3  \n",
       "7                      234.1  \n",
       "8                      216.9  "
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table_dfs[2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec047384",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['This table provides information on different models and their performance in various tasks such as commonsense reasoning, world knowledge, reading comprehension, math MMLU, BBH, and AGI evaluation. The table includes the model names, sizes, and scores for each task.,\\nwith the following table title:\\nModel Performance Comparison,\\nwith the following columns:\\n- Model: The name of the model\\n- Size: The size of the model\\n- Code: The code score for the model\\n- Commonsense Reasoning: The score for commonsense reasoning task\\n- World Knowledge: The score for world knowledge task\\n- Reading Comprehension: The score for reading comprehension task\\n- Math MMLU: The score for math MMLU task\\n- BBH: The score for BBH task\\n- AGI Eval: The score for AGI evaluation task\\n',\n",
       " 'This table provides benchmark scores for different language models, including GPT-3.5, GPT-4, PaLM, PaLM-2-L, and Llama 2. The scores are measured in various tasks such as MMLU, TriviaQA, Natural Questions, GSM8K, HumanEval, and BIG-Bench Hard. The table shows the performance of each model in different shot scenarios.,\\nwith the following table title:\\nBenchmark Scores for Language Models,\\nwith the following columns:\\n- Benchmark (shots): The tasks and shot scenarios used for benchmarking\\n- GPT-3.5: The benchmark score for the GPT-3.5 model\\n- GPT-4: The benchmark score for the GPT-4 model\\n- PaLM: The benchmark score for the PaLM model\\n- PaLM-2-L: The benchmark score for the PaLM-2-L model\\n- Llama 2: The benchmark score for the Llama 2 model\\n',\n",
       " 'This table provides statistics on various datasets, including the number of comparisons, average number of turns per dialogue, average number of tokens per example, average number of tokens in the prompt, and average number of tokens in the response.,\\nwith the following table title:\\nDataset Statistics,\\nwith the following columns:\\n- Dataset: The name of the dataset\\n- Num. of Comparisons: The number of comparisons in the dataset\\n- Avg. # Turns per Dialogue: The average number of turns per dialogue in the dataset\\n- Avg. # Tokens per Example: The average number of tokens per example in the dataset\\n- Avg. # Tokens in Prompt: The average number of tokens in the prompt in the dataset\\n- Avg. # Tokens in Response: The average number of tokens in the response in the dataset\\n']"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table_summaries"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "054ffd3c",
   "metadata": {},
   "source": [
    "## Create Pandas Query Engines\n",
    "\n",
    "We create a pandas query engine over each structured table (`dataframe`).\n",
    "\n",
    "These can be executed on their own to answer queries about each table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2646e701",
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging\n",
    "import sys\n",
    "from IPython.display import Markdown, display\n",
    "\n",
    "import pandas as pd\n",
    "from llama_index.query_engine import PandasQueryEngine\n",
    "from llama_index import VectorStoreIndex, ServiceContext\n",
    "from llama_index.query_engine import PandasQueryEngine, RetrieverQueryEngine\n",
    "from llama_index.retrievers import RecursiveRetriever\n",
    "from llama_index.schema import IndexNode\n",
    "from llama_index.llms import OpenAI\n",
    "\n",
    "\n",
    "df_query_engines = [\n",
    "    PandasQueryEngine(table_df, service_context=service_context)\n",
    "    for table_df in table_dfs\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "98790b6a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# define index nodes\n",
    "\n",
    "df_index_nodes = [\n",
    "    IndexNode(text=summary, index_id=f\"pandas{idx}\")\n",
    "    for idx, summary in enumerate(table_summaries)\n",
    "]\n",
    "\n",
    "df_id_query_engine_mapping = {\n",
    "    f\"pandas{idx}\": df_query_engine\n",
    "    for idx, df_query_engine in enumerate(df_query_engines)\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a01216d8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# construct top-level vector index + query engine\n",
    "vector_index = VectorStoreIndex(doc_nodes + df_index_nodes)\n",
    "vector_retriever = vector_index.as_retriever(similarity_top_k=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b6e1a8e0",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.retrievers import RecursiveRetriever\n",
    "from llama_index.query_engine import RetrieverQueryEngine\n",
    "from llama_index.response_synthesizers import get_response_synthesizer\n",
    "\n",
    "# dedup_nodes = [node for node in doc_nodes if not isinstance(node, IndexNode)]\n",
    "\n",
    "\n",
    "recursive_retriever = RecursiveRetriever(\n",
    "    \"vector\",\n",
    "    retriever_dict={\"vector\": vector_retriever},\n",
    "    query_engine_dict=df_id_query_engine_mapping,\n",
    "    node_dict=node_mappings,\n",
    "    verbose=True,\n",
    ")\n",
    "\n",
    "response_synthesizer = get_response_synthesizer(\n",
    "    service_context=service_context, response_mode=\"compact\"\n",
    ")\n",
    "\n",
    "recursive_query_engine = RetrieverQueryEngine.from_args(\n",
    "    recursive_retriever, response_synthesizer=response_synthesizer\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7268f3ef",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[1;3;34mRetrieving with query id None: What is MPT 30b performance for common sense reasoning?\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieving text node: This table provides information on different models and their performance in various tasks such as commonsense reasoning, world knowledge, reading comprehension, math MMLU, BBH, and AGI evaluation. The table includes the model names, sizes, and scores for each task.,\n",
      "with the following table title:\n",
      "Model Performance Comparison,\n",
      "with the following columns:\n",
      "- Model: The name of the model\n",
      "- Size: The size of the model\n",
      "- Code: The code score for the model\n",
      "- Commonsense Reasoning: The score for commonsense reasoning task\n",
      "- World Knowledge: The score for world knowledge task\n",
      "- Reading Comprehension: The score for reading comprehension task\n",
      "- Math MMLU: The score for math MMLU task\n",
      "- BBH: The score for BBH task\n",
      "- AGI Eval: The score for AGI evaluation task\n",
      "\n",
      "|Model|Size|Code|Commonsense Reasoning|World Knowledge|Reading Comprehension|Math MMLU|BBH|AGI Eval|\n",
      "|---|---|---|---|---|---|---|---|---|\n",
      "|MPT|7B|20.5|57.4|41.0|57.5|4.9|26.8|31.0|\n",
      "|MPT|30B|28.9|64.9|50.0|64.7|9.1|46.9|38.0|\n",
      "|Falcon|7B|5.6|56.1|42.8|36.0|4.6|26.2|28.0|\n",
      "|Falcon|40B|15.2|69.2|56.7|65.7|12.6|55.4|37.1|\n",
      "|Falcon|7B|14.1|60.8|46.2|58.5|6.95|35.1|30.3|\n",
      "|Llama 1|13B|18.9|66.1|52.6|62.3|10.9|46.9|37.0|\n",
      "|Llama 1|33B|26.0|70.0|58.4|67.6|21.4|57.8|39.8|\n",
      "|Llama 1|65B|30.7|70.7|60.5|68.6|30.8|63.4|43.5|\n",
      "|Llama 1|7B|16.8|63.9|48.9|61.3|14.6|45.3|32.6|\n",
      "|Llama 2|13B|24.5|66.9|55.4|65.8|28.7|54.8|39.4|\n",
      "|Llama 2|34B|27.8|69.9|58.7|68.0|24.2|62.6|44.1|\n",
      "|Llama 2|70B|37.5|71.9|63.6|69.4|35.2|68.9|51.2|\n",
      "\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieved node with id, entering: pandas0\n",
      "\u001b[0m\u001b[1;3;34mRetrieving with query id pandas0: What is MPT 30b performance for common sense reasoning?\n",
      "\u001b[0m\u001b[1;3;32mGot response: 64.9\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieved node with id, entering: id_1_table\n",
      "\u001b[0m\u001b[1;3;34mRetrieving with query id id_1_table: What is MPT 30b performance for common sense reasoning?\n",
      "\u001b[0mThe performance of MPT 30B for common sense reasoning is 64.9.\n",
      "\u001b[1;3;34mRetrieving with query id None: What is PaLM-2-L performance for TriviaQA?\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieved node with id, entering: pandas1\n",
      "\u001b[0m\u001b[1;3;34mRetrieving with query id pandas1: What is PaLM-2-L performance for TriviaQA?\n",
      "\u001b[0m\u001b[1;3;32mGot response: 86.1\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieved node with id, entering: id_3_table\n",
      "\u001b[0m\u001b[1;3;34mRetrieving with query id id_3_table: What is PaLM-2-L performance for TriviaQA?\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieving text node: This table provides benchmark scores for different language models, including GPT-3.5, GPT-4, PaLM, PaLM-2-L, and Llama 2. The scores are measured in various tasks such as MMLU, TriviaQA, Natural Questions, GSM8K, HumanEval, and BIG-Bench Hard. The table shows the performance of each model in different shot scenarios.,\n",
      "with the following table title:\n",
      "Benchmark Scores for Language Models,\n",
      "with the following columns:\n",
      "- Benchmark (shots): The tasks and shot scenarios used for benchmarking\n",
      "- GPT-3.5: The benchmark score for the GPT-3.5 model\n",
      "- GPT-4: The benchmark score for the GPT-4 model\n",
      "- PaLM: The benchmark score for the PaLM model\n",
      "- PaLM-2-L: The benchmark score for the PaLM-2-L model\n",
      "- Llama 2: The benchmark score for the Llama 2 model\n",
      "\n",
      "|Benchmark (shots)|GPT-3.5|GPT-4|PaLM|PaLM-2-L|Llama 2|\n",
      "|---|---|---|---|---|---|\n",
      "|MMLU (5-shot)|70.0|86.4|69.3|78.3|68.9|\n",
      "|TriviaQA (1-shot)|–|–|81.4|86.1|85.0|\n",
      "|Natural Questions (1-shot)|–|–|29.3|37.5|33.0|\n",
      "|GSM8K (8-shot)|57.1|92.0|56.5|80.7|56.8|\n",
      "|HumanEval (0-shot)|48.1|67.0|26.2|–|29.9|\n",
      "|BIG-Bench Hard (3-shot)|–|–|52.3|65.7|51.2|\n",
      "\n",
      "\u001b[0mThe performance of PaLM-2-L for TriviaQA is 86.1.\n",
      "\u001b[1;3;34mRetrieving with query id None: What is LLAMA 2 performance for HumanEval?\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieving text node: Table 3: Overall performance on grouped academic benchmarks compared to open-source base models.\n",
      "\n",
      "    • Popular Aggregated Benchmarks. We report the overall results for MMLU (5 shot) (Hendrycks et al., 2020), Big Bench Hard (BBH) (3 shot) (Suzgun et al., 2022), and AGI Eval (3–5 shot) (Zhong et al., 2023). For AGI Eval, we only evaluate on the English tasks and report the average.\n",
      "\n",
      "    As shown in Table 3, Llama 2 models outperform Llama 1 models. In particular, Llama 2 70B improves the results on MMLU and BBH by ⇡5 and ⇡8 points, respectively, compared to Llama 1 65B. Llama 2 7B and 30B models outperform MPT models of the corresponding size on all categories besides code benchmarks. For the Falcon models, Llama 2 7B and 34B outperform Falcon 7B and 40B models on all categories of benchmarks. Additionally, Llama 2 70B model outperforms all open-source models.\n",
      "\n",
      "    In addition to open-source models, we also compare Llama 2 70B results to closed-source models. As shown in Table 4, Llama 2 70B is close to GPT-3.5 (OpenAI, 2023) on MMLU and GSM8K, but there is a significant gap on coding benchmarks. Llama 2 70B results are on par or better than PaLM (540B) (Chowdhery et al., 2022) on almost all benchmarks. There is still a large gap in performance between Llama 2 70B and GPT-4 and PaLM-2-L.\n",
      "\n",
      "    We also analysed the potential data contamination and share the details in Section A.6.\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieved node with id, entering: pandas1\n",
      "\u001b[0m\u001b[1;3;34mRetrieving with query id pandas1: What is LLAMA 2 performance for HumanEval?\n",
      "\u001b[0m\u001b[1;3;32mGot response: 29.9\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieved node with id, entering: id_3_table\n",
      "\u001b[0m\u001b[1;3;34mRetrieving with query id id_3_table: What is LLAMA 2 performance for HumanEval?\n",
      "\u001b[0mThe performance of the Llama 2 model for HumanEval is 29.9.\n",
      "\u001b[1;3;34mRetrieving with query id None: What is LLAMA 2 performance for AGI Eval?\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieving text node: Table 3: Overall performance on grouped academic benchmarks compared to open-source base models.\n",
      "\n",
      "    • Popular Aggregated Benchmarks. We report the overall results for MMLU (5 shot) (Hendrycks et al., 2020), Big Bench Hard (BBH) (3 shot) (Suzgun et al., 2022), and AGI Eval (3–5 shot) (Zhong et al., 2023). For AGI Eval, we only evaluate on the English tasks and report the average.\n",
      "\n",
      "    As shown in Table 3, Llama 2 models outperform Llama 1 models. In particular, Llama 2 70B improves the results on MMLU and BBH by ⇡5 and ⇡8 points, respectively, compared to Llama 1 65B. Llama 2 7B and 30B models outperform MPT models of the corresponding size on all categories besides code benchmarks. For the Falcon models, Llama 2 7B and 34B outperform Falcon 7B and 40B models on all categories of benchmarks. Additionally, Llama 2 70B model outperforms all open-source models.\n",
      "\n",
      "    In addition to open-source models, we also compare Llama 2 70B results to closed-source models. As shown in Table 4, Llama 2 70B is close to GPT-3.5 (OpenAI, 2023) on MMLU and GSM8K, but there is a significant gap on coding benchmarks. Llama 2 70B results are on par or better than PaLM (540B) (Chowdhery et al., 2022) on almost all benchmarks. There is still a large gap in performance between Llama 2 70B and GPT-4 and PaLM-2-L.\n",
      "\n",
      "    We also analysed the potential data contamination and share the details in Section A.6.\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieving text node: This table provides information on different models and their performance in various tasks such as commonsense reasoning, world knowledge, reading comprehension, math MMLU, BBH, and AGI evaluation. The table includes the model names, sizes, and scores for each task.,\n",
      "with the following table title:\n",
      "Model Performance Comparison,\n",
      "with the following columns:\n",
      "- Model: The name of the model\n",
      "- Size: The size of the model\n",
      "- Code: The code score for the model\n",
      "- Commonsense Reasoning: The score for commonsense reasoning task\n",
      "- World Knowledge: The score for world knowledge task\n",
      "- Reading Comprehension: The score for reading comprehension task\n",
      "- Math MMLU: The score for math MMLU task\n",
      "- BBH: The score for BBH task\n",
      "- AGI Eval: The score for AGI evaluation task\n",
      "\n",
      "|Model|Size|Code|Commonsense Reasoning|World Knowledge|Reading Comprehension|Math MMLU|BBH|AGI Eval|\n",
      "|---|---|---|---|---|---|---|---|---|\n",
      "|MPT|7B|20.5|57.4|41.0|57.5|4.9|26.8|31.0|\n",
      "|MPT|30B|28.9|64.9|50.0|64.7|9.1|46.9|38.0|\n",
      "|Falcon|7B|5.6|56.1|42.8|36.0|4.6|26.2|28.0|\n",
      "|Falcon|40B|15.2|69.2|56.7|65.7|12.6|55.4|37.1|\n",
      "|Falcon|7B|14.1|60.8|46.2|58.5|6.95|35.1|30.3|\n",
      "|Llama 1|13B|18.9|66.1|52.6|62.3|10.9|46.9|37.0|\n",
      "|Llama 1|33B|26.0|70.0|58.4|67.6|21.4|57.8|39.8|\n",
      "|Llama 1|65B|30.7|70.7|60.5|68.6|30.8|63.4|43.5|\n",
      "|Llama 1|7B|16.8|63.9|48.9|61.3|14.6|45.3|32.6|\n",
      "|Llama 2|13B|24.5|66.9|55.4|65.8|28.7|54.8|39.4|\n",
      "|Llama 2|34B|27.8|69.9|58.7|68.0|24.2|62.6|44.1|\n",
      "|Llama 2|70B|37.5|71.9|63.6|69.4|35.2|68.9|51.2|\n",
      "\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieved node with id, entering: id_1_table\n",
      "\u001b[0m\u001b[1;3;34mRetrieving with query id id_1_table: What is LLAMA 2 performance for AGI Eval?\n",
      "\u001b[0mThe performance of the Llama 2 model for the AGI Eval task is 51.2.\n",
      "\u001b[1;3;34mRetrieving with query id None: What is LLAMA 2 performance for HumanEval and AGI Eval?\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieving text node: Table 3: Overall performance on grouped academic benchmarks compared to open-source base models.\n",
      "\n",
      "    • Popular Aggregated Benchmarks. We report the overall results for MMLU (5 shot) (Hendrycks et al., 2020), Big Bench Hard (BBH) (3 shot) (Suzgun et al., 2022), and AGI Eval (3–5 shot) (Zhong et al., 2023). For AGI Eval, we only evaluate on the English tasks and report the average.\n",
      "\n",
      "    As shown in Table 3, Llama 2 models outperform Llama 1 models. In particular, Llama 2 70B improves the results on MMLU and BBH by ⇡5 and ⇡8 points, respectively, compared to Llama 1 65B. Llama 2 7B and 30B models outperform MPT models of the corresponding size on all categories besides code benchmarks. For the Falcon models, Llama 2 7B and 34B outperform Falcon 7B and 40B models on all categories of benchmarks. Additionally, Llama 2 70B model outperforms all open-source models.\n",
      "\n",
      "    In addition to open-source models, we also compare Llama 2 70B results to closed-source models. As shown in Table 4, Llama 2 70B is close to GPT-3.5 (OpenAI, 2023) on MMLU and GSM8K, but there is a significant gap on coding benchmarks. Llama 2 70B results are on par or better than PaLM (540B) (Chowdhery et al., 2022) on almost all benchmarks. There is still a large gap in performance between Llama 2 70B and GPT-4 and PaLM-2-L.\n",
      "\n",
      "    We also analysed the potential data contamination and share the details in Section A.6.\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieving text node: This table provides information on different models and their performance in various tasks such as commonsense reasoning, world knowledge, reading comprehension, math MMLU, BBH, and AGI evaluation. The table includes the model names, sizes, and scores for each task.,\n",
      "with the following table title:\n",
      "Model Performance Comparison,\n",
      "with the following columns:\n",
      "- Model: The name of the model\n",
      "- Size: The size of the model\n",
      "- Code: The code score for the model\n",
      "- Commonsense Reasoning: The score for commonsense reasoning task\n",
      "- World Knowledge: The score for world knowledge task\n",
      "- Reading Comprehension: The score for reading comprehension task\n",
      "- Math MMLU: The score for math MMLU task\n",
      "- BBH: The score for BBH task\n",
      "- AGI Eval: The score for AGI evaluation task\n",
      "\n",
      "|Model|Size|Code|Commonsense Reasoning|World Knowledge|Reading Comprehension|Math MMLU|BBH|AGI Eval|\n",
      "|---|---|---|---|---|---|---|---|---|\n",
      "|MPT|7B|20.5|57.4|41.0|57.5|4.9|26.8|31.0|\n",
      "|MPT|30B|28.9|64.9|50.0|64.7|9.1|46.9|38.0|\n",
      "|Falcon|7B|5.6|56.1|42.8|36.0|4.6|26.2|28.0|\n",
      "|Falcon|40B|15.2|69.2|56.7|65.7|12.6|55.4|37.1|\n",
      "|Falcon|7B|14.1|60.8|46.2|58.5|6.95|35.1|30.3|\n",
      "|Llama 1|13B|18.9|66.1|52.6|62.3|10.9|46.9|37.0|\n",
      "|Llama 1|33B|26.0|70.0|58.4|67.6|21.4|57.8|39.8|\n",
      "|Llama 1|65B|30.7|70.7|60.5|68.6|30.8|63.4|43.5|\n",
      "|Llama 1|7B|16.8|63.9|48.9|61.3|14.6|45.3|32.6|\n",
      "|Llama 2|13B|24.5|66.9|55.4|65.8|28.7|54.8|39.4|\n",
      "|Llama 2|34B|27.8|69.9|58.7|68.0|24.2|62.6|44.1|\n",
      "|Llama 2|70B|37.5|71.9|63.6|69.4|35.2|68.9|51.2|\n",
      "\n",
      "\u001b[0m\u001b[1;3;38;5;200mRetrieved node with id, entering: id_3_table\n",
      "\u001b[0m\u001b[1;3;34mRetrieving with query id id_3_table: What is LLAMA 2 performance for HumanEval and AGI Eval?\n",
      "\u001b[0mThe performance of the Llama 2 model for the HumanEval task is 29.9, while for the AGI Eval task, it scores 51.2.\n"
     ]
    }
   ],
   "source": [
    "# query different questions\n",
    "response_1 = recursive_query_engine.query(\n",
    "    \"What is MPT 30b performance for common sense reasoning?\"\n",
    ")\n",
    "print(response_1)\n",
    "\n",
    "response_2 = recursive_query_engine.query(\n",
    "    \"What is PaLM-2-L performance for TriviaQA?\"\n",
    ")\n",
    "print(response_2)\n",
    "\n",
    "\n",
    "response_3 = recursive_query_engine.query(\n",
    "    \"What is LLAMA 2 performance for HumanEval?\"\n",
    ")\n",
    "print(response_3)\n",
    "\n",
    "response_4 = recursive_query_engine.query(\n",
    "    \"What is LLAMA 2 performance for AGI Eval?\"\n",
    ")\n",
    "print(response_4)\n",
    "\n",
    "response_5 = recursive_query_engine.query(\n",
    "    \"What is LLAMA 2 performance for HumanEval and AGI Eval?\"\n",
    ")\n",
    "print(response_5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9b09db17",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The performance of MPT 30B for common sense reasoning is 64.9.\n",
      "The performance of PaLM-2-L for TriviaQA is 86.1.\n",
      "The performance of LLAMA 2 for HumanEval is 29.9.\n",
      "The performance of the Llama 2 model for the AGI Eval task varies depending on the size of the model. The Llama 2 13B model scored 39.4, the Llama 2 34B model scored 44.1, and the Llama 2 70B model scored 51.2.\n",
      "The Llama 2 model scored 29.9 on the HumanEval task and 51.2 on the AGI Eval task.\n"
     ]
    }
   ],
   "source": [
    "print(response_1)\n",
    "print(response_2)\n",
    "print(response_3)\n",
    "print(response_4)\n",
    "print(response_5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79874a1a",
   "metadata": {},
   "source": [
    "Observation: Baseline 3 approach can answer all the 5 questions. For the 4 question, it can answer the score for all the LLama model variations."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama-index-vs8PXMh0-py3.11",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
